Maddison Database Project

The goal in this notebook is to replicate the regional aggregations of GDP available in the Maddison Database Project v2020.

Three files are processed first, two from the official release of the Maddison Database Project (full data and regional data) and also a regional composition file, provided by Jutta Bolt

This is the structure of the Maddison Database:

The file is also available at OWID's catalog, which integrates both country and regional data to one dataset, drops the country code variable and adds a GDP column

From the dataset coming from the catalog the regional aggregations are extracted to compare them with recontructed aggregations.

From the regional composition file the countries per each region in different years are extracted and merged with the country data.

Only non-null regions and estimates are kept.

From this the reconstructed aggregations are made, strictly using the countries and years defined in the regional composition file.

The comparison dataframe is a join between the official and reconstructed regional data, where a ratio variable is introduced as the division between the reconstructed GDP pc value and the official. Consequentially, if both values are equal this ratio would be 1.

This ratio is close to 1 on average, but there are differences between the recalculation and the official data, which can be seen in this graph:

Two of the most clear differences are in for East Asia in 1940 (ratio 3.6) and 1920 (2.6), with a ratio of 2.56, which means the reconstruction is twice and three times the official estimates. For the case of 1920:

The big difference is explained by the lack of GDP pc data for China in 1920. East Asia in 1920 is defined as Japan and China:

But this is not the only explanation to the differences, because for example Eastern Europe in 1950 is composed of five countries (Bulgaria, Cszechoslovakia, Hungary, Poland, Romania) and they all have data for 1950. The ratio is 0.82 though.

What is the reason for the difference then? For v2020 there is not information about how regional calculations are estimated, but at least for v2018 there is a regional data file which shows this: image.png

It is necessary then to estimate GDP growth values to apply them to the 1950 estimates. The variable growth estimates the difference between the GDP pc from one year and country and the value from the same country in the previous year.

Data manipulation is necessary for the MDP coming from catalog to be merged with the rest of datasets

This ratio is also close to 1 on average, but there are also differences depending on the region:

This ratio is also close to 1 on average, but there are also differences depending on the region:

For 1940 there are still differences with the official estimation, so this method is not the one they use